Resolving Multi-Path Interference in Compressive Time-of-Flight Depth Imaging with a Multi-Tap Macro-Pixel Computational CMOS Image Sensor

Multi-path interference causes depth errors in indirect time-of-flight (ToF) cameras. In this paper, resolving multi-path interference caused by surface reflections using a multi-tap macro-pixel computational CMOS image sensor is demonstrated. The imaging area is implemented by an array of macro-pixels composed of four subpixels embodied by a four-tap lateral electric field charge modulator (LEFM). This sensor can simultaneously acquire 16 images for different temporal shutters. This method can reproduce more than 16 images based on compressive sensing with multi-frequency shutters and sub-clock shifting. In simulations, an object was placed 16 m away from the sensor, and the depth of an interference object was varied from 1 to 32 m in 1 m steps. The two reflections were separated in two stages: coarse estimation based on a compressive sensing solver and refinement by a nonlinear search to investigate the potential of our sensor. Relative standard deviation (precision) and relative mean error (accuracy) were evaluated under the influence of photon shot noise. The proposed method was verified using a prototype multi-tap macro-pixel computational CMOS image sensor in single-path and dual-path situations. In the experiment, an acrylic plate was placed 1 m or 2 m and a mirror 9.3 m from the sensor.


Introduction
Time-of-flight (ToF) [1] depth imaging is a technique for measuring the depth between a camera and objects based on the round-trip time of light emitted from the camera, assuming that the speed of light is constant. ToF has been applied to a variety of applications [2] such as autonomous driving, robot navigation, modeling of objects, and gesture recognition in entertainment due to the small device size, real-time measurement, and the ability to measure texture-less surfaces. However, ToF has a disadvantage in that the measurement accuracy is degraded by multi-path interference [3,4]. There are four major types of multipath interference: (1) surface reflections from transparent objects, (2) multiple reflections between objects, (3) sub-surface scattering, and (4) volumetric scattering caused by bulky weak scattering media such as smoke or fog. In this paper, we focus on surface reflections.
There are two types of ToF image sensors: direct (dToF) [5] and indirect (iToF) [6]. In dToF, impulse light with a duration of nanoseconds is mostly used, and the temporal reflected light waveform is measured pixel by pixel by using a single-photon avalanche diode (SPAD) [7]. Comparison of dToF and iToF is summarized in Table 1. In iToF, on the other hand, the inner products or correlations of the reflected light waveform with multiple time-window functions are measured. Then, the temporal or phase delay is calculated. To perform the inner-product operation, charge modulators are utilized. iToF is further classified into amplitude modulation continuous wave (AMCW) iToF [8] and pulse iToF [9]. AMCW iToF uses sinusoidally modulated light, and pulse iToF uses rectangular pulsed light with a duration of tens of nanoseconds. dToF provides the genuine waveform of the reflected light for many temporal sampling points (typically hundreds of points or more) with a high temporal resolution by using high-precision time-to-digital converters (TDCs). Although separation of multi-path components is easier by dToF than by iToF because the whole waveform of the reflected light is obtained, large digital circuits are required to build the histograms of photon arrival times. iToF is advantageous for achieving a small sensor size because signal detection is performed in the charge domain in pixels, and large digital or analog circuits are required for detection. However, the number of temporal or phase sampling points of the detected signal is only a few, typically three or four. Therefore, multi-path component separation is more difficult in iToF than in dToF. To decompose or compensate for the surface-reflection-type multi-path interference, methods that use multiple modulation frequencies or delays have been demonstrated [10][11][12][13][14]. However, these methods are vulnerable to motion artifacts, since multiple images with some scanning are necessary. Recently, deep-learning-based methods have been emerging [15,16]. Reference [16] proposed Deep ToF, which uses a single depth image taken by an ordinary AMCW iToF camera. Then, the multi-path interference is corrected by deep learning in real time. This method uses information about neighboring pixels or the scene to correct multi-path interference caused by multiple reflections. However, it could be difficult to resolve multi-path interference when neighboring pixels do not provide any cue to correct the multi-path interference. This problem could arise for the case of surface-reflection-type multi-path interference. To achieve motion-artifact-free single-shot imaging acquisition in ToF, specially designed image sensors are necessary.
In our previous study [17,18], we have proposed a method for multi-path component separation using multi-aperture-based temporally compressive pulse iToF. This method requires no large on-chip digital circuits, and the number of temporal sampling points is much more than that of conventional iToF, by taking advantage of the sparsity of the ToF signal. Figure 1 shows a method for separating multi-path components based on compressive sensing [19][20][21][22]. In this method, 15 inner products of the reflected pulse light with binary random shutters are acquired at the same time. Then, 32-bin reflected light histograms are reproduced. Thus, the drawback of iToF, i.e., the limited number of sampling points, is alleviated, and multiple reflected light paths are separated. However, this method requires a custom lens array, so it is not easy to reconfigure the optics. In addition, it is necessary to compensate for the parallax among the lenses. In the framework in [18], the depth resolution is the same as the width of the minimum time window, although sub-window resolution is achieved in ordinary iToF cameras.
We have proposed a multi-tap macro-pixel-based compressive ultra-high-speed CMOS image sensor [23]. This image sensor can implement multi-frequency shutters and sub-clock shifting. Such features are useful for multi-path component separation in a single shot. By using a macro-pixel structure instead of a multi-aperture structure, it is possible to capture images using ordinary single-aperture lenses so that there is no disparity problem. In this paper, we evaluate the potential performance of separating dual-path components based on the multi-tap macro-pixel CMOS image sensor with multiple modulation frequencies and sub-clock shifting by simulation. Then, we demonstrate dual-path component separation in temporally compressive pulse iToF with a prototype image sensor. The depth estimation is performed in two steps: (1) separation of multiple peaks using a compressive sensing solver TVAL3 [24,25] and (2) refining the depths with sub-clock accuracy by fitting the result obtained at step 1 by a nonlinear optimization method [26].
Step 2 enables sub-clockresolution depth estimation, which was not achieved in our previous study. Although this method is not suitable for real-time processing, the purpose of this study is to investigate the potential performance of the proposed scheme. gle shot. By using a macro-pixel structure instead of a multi-aperture structure, it is possible to capture images using ordinary single-aperture lenses so that there is no disparity problem. In this paper, we evaluate the potential performance of separating dual-path components based on the multi-tap macro-pixel CMOS image sensor with multiple modulation frequencies and sub-clock shifting by simulation. Then, we demonstrate dual-path component separation in temporally compressive pulse iToF with a prototype image sensor. The depth estimation is performed in two steps: (1) separation of multiple peaks using a compressive sensing solver TVAL3 [24,25] and (2) refining the depths with sub-clock accuracy by fitting the result obtained at step 1 by a nonlinear optimization method [26].
Step 2 enables sub-clock-resolution depth estimation, which was not achieved in our previous study. Although this method is not suitable for real-time processing, the purpose of this study is to investigate the potential performance of the proposed scheme.
In Section 2, the multi-tap macro-pixel compressive image sensor architecture and depth imaging based on the temporally compressive pulse iToF are described. Section 3 shows system modeling and performance evaluation by simulation. In Section 4, dual-path component separation is demonstrated. In the experiment, a transparent acrylic plate was placed between the camera and an objective mirror to introduce multi-path interference. Section 5 discusses the issues of the proposed method. Section 6 concludes this paper. Figure 1. Separation of multi-path interference components: (a) optical setup that includes a transparent object causing multi-path interference, (b) compression of optical signals with temporal shutters, (c) compressed signals, and (d) reconstructed optical signal.  In Section 2, the multi-tap macro-pixel compressive image sensor architecture and depth imaging based on the temporally compressive pulse iToF are described. Section 3 shows system modeling and performance evaluation by simulation. In Section 4, dual-path component separation is demonstrated. In the experiment, a transparent acrylic plate was placed between the camera and an objective mirror to introduce multi-path interference. Section 5 discusses the issues of the proposed method. Section 6 concludes this paper.  Figure 2 shows the architecture of a multi-tap macro-pixel computational CMOS image sensor. The sensor is mainly composed of a shutter controller, a clock tree, a lateral electric field charge modulator (LEFM) [27,28] driver, an array of macro-pixels, and readout circuits (RDOs). The specifications are summarized in Table 2.
Subpixel count per macro-pixel 2 × 2 Tap count per subpixel 4 Shutter length per tap 8 to 256 bits by 8 bits The sensor was fabricated in a 0.11 µm CMOS image sensor process. A m is composed of 2 × 2 subpixels. Each subpixel is implemented by a four-tap LE has four storages (FD1-4). Charge transfer to the FDs and charge drain is co five gate signals (G1-4 and GD). The shutter pattern of each gate is program shutter length per tap can be set to 8-256 bits in steps of 8 bits. Here, the shut defined as the length of binary shutters and represented by bits. In Figure 1b, code in a unit time is given by one bit. This sensor compresses images when the shutter length is longer than 1 number of the total taps in the macro-pixel or the number of measured signals pixel). The mathematical representation of compressive sensing is briefly me low. When we consider an -dimensional column vector , an × matri -dimensional column vector , the relationship is denoted by = .
Note that is the original input signal, and is the signal measured known measurement matrix . When > , the original signal is comp    The sensor was fabricated in a 0.11 µm CMOS image sensor process. A macro-pixel is composed of 2 × 2 subpixels. Each subpixel is implemented by a four-tap LEFM, which has four storages (FD1-4). Charge transfer to the FDs and charge drain is controlled by five gate signals (G1-4 and GD). The shutter pattern of each gate is programmable. The shutter length per tap can be set to 8-256 bits in steps of 8 bits. Here, the shutter length is defined as the length of binary shutters and represented by bits. In Figure 1b, the shutter code in a unit time is given by one bit.
This sensor compresses images when the shutter length is longer than 16 bits (=the number of the total taps in the macro-pixel or the number of measured signals per macropixel). The mathematical representation of compressive sensing is briefly mentioned below. When we consider an N-dimensional column vector x, an M × N matrix A, and an Mdimensional column vector y, the relationship is denoted by (1) Note that x is the original input signal, and y is the signal measured through a known measurement matrix A. When N > M, the original signal x is compressed. We can reproduce the original input signal x by solving the inverse problem from y and A based on a sparsity constraint. Figure 3 shows the imaging process of a multi-tap macro-pixel computational CMOS image sensor. First, each tap detects the optical signal from the objects using different temporal shutters. Then, different temporally compressed images are obtained. After solving the inverse problem from the temporally compressed images and shutter patterns based on the sparsity, the time-sequential images are reproduced.
sors 2022, 22, x FOR PEER REVIEW can reproduce the original input signal by solving the inverse p based on a sparsity constraint. Figure 3 shows the imaging process of a multi-tap macro-pixel image sensor. First, each tap detects the optical signal from the temporal shutters. Then, different temporally compressed imag solving the inverse problem from the temporally compressed imag based on the sparsity, the time-sequential images are reproduced. For simplicity, the image acquisition and reproduction are exp gle pixel; i.e., is a temporal waveform of the received light for a s shutter is represented by the measurement matrix in compre sumed that is composed of sets of temporal shutters given sional matrix. Then, Equation (1) is rewritten as where each row of describes one temporal sequence of a shutte taps is the same as the number of taps in a subpixel times the nu macro-pixel, namely = 4 × 4 = 16. If the original input signal means only elements have non-zero values and all the other e estimated by the following optimization based on the -norm: Time

Reproduced images
Solve an inverse problem For simplicity, the image acquisition and reproduction are explained based on a single pixel; i.e., x is a temporal waveform of the received light for a single pixel. The coded shutter is represented by the measurement matrix A in compressive sensing. It is assumed that A is composed of M sets of temporal shutters given by the M × N-dimensional matrix. Then, Equation (1) is rewritten as where each row of A describes one temporal sequence of a shutter. The total number of taps is the same as the number of taps in a subpixel times the number of subpixels in a macro-pixel, namely M = 4 × 4 = 16. If the original input signal x is K-sparse, which means only K elements have non-zero values and all the other elements are zero, x is estimated by the following optimization based on the l 0 -norm: Here, x m shows l m -norm of x. Note that l 0 -norm is defined as the number of the non-zero elements in x. However, solving this problem is hard because this is a huge combinatorial problem. Therefore, l 1 -norm minimization is used instead, as follows: Total variation (TV) minimization is widely used to solve l 1 -norm minimization, where the original input signal x is estimated bŷ Here, D i is the differential operator that subtracts an adjacent element from the i-th element.

Modeling of Multi-Path Interference in ToF
The detected light waveform g(t) at the ToF camera, shown in Figure 4c, is modeled as follows: where f (t) (Figure 4a) is the scene response function, L(t) (Figure 4b) is the waveform of the total system response, and * is the convolution operator. In the multi-path scenario, f (t) is modeled as where a i is the amplitude and d i is the depth of the i-th reflection, c is the speed of light, K is the total number of reflections, and δ(x) is the δ-function. L(t) is represented by the convolution of the light source waveform L 0 (t) (Figure 4d) and the sensor response h s (t) (Figure 4e): where is the amplitude and is the depth of the -th reflection, is the speed of light, is the total number of reflections, and ( ) is the -function. ( ) is represented by the convolution of the light source waveform ( ) (Figure 4d) and the sensor response ℎ ( ) (Figure 4e): The sensor response ℎ ( ) is further represented by the convolution of the response of the charge modulator ℎ ( ) and the photodiode response ℎ ( ). When either of these responses is dominant, ℎ ( ) can be approximated by the first-order delay: where is the time constant of the modulator or photodiode. Thus, the detected light waveform ( ) at the ToF camera is shown in Figure 4c and is modeled as The image sensor puts out M compressed signals { } ( = 1, … , ) (Figure 4f), which is also denoted by a vector representation, = ( , … , ) . Note that means the transpose of a vector, . When the shutter pattern for the -th tap is written as ( ), the detected signal is given by the inner product or correlation between ( ) and ( ):  The sensor response h s (t) is further represented by the convolution of the response of the charge modulator h s1 (t) and the photodiode response h s2 (t). When either of these responses is dominant, h s (t) can be approximated by the first-order delay: where τ is the time constant of the modulator or photodiode. Thus, the detected light waveform g(t) at the ToF camera is shown in Figure 4c and is modeled as The image sensor puts out M compressed signals {p m } (m = 1, . . . , M) (Figure 4f), which is also denoted by a vector representation, p = (p 1 , . . . , p M ) T . Note that x T means the transpose of a vector, x. When the shutter pattern for the m-th tap is written as w m (t), the detected signal p m is given by the inner product or correlation between w m (t) and g(t):

Selection of Exposure Patterns
As mentioned in Section 2.1, 16 different shutters are applied to the multi-tap macropixel computational CMOS image sensor. In this paper, 32-bit shutter patterns shown in Figure 5 are used. The shutter patterns of subpixel 1 are shifted by half a clock from those of subpixel 2. Time windows for slower frequencies can increase the depth range, while high-frequency time windows improve the depth resolution. By using half-clock shifting, the depth resolution can be improved by a factor of two without shortening the width of the minimum time window or increasing the operating frequency of the sensor.

Selection of Exposure Patterns
As mentioned in Section 2.1, 16 different shutters are applied to the mu pixel computational CMOS image sensor. In this paper, 32-bit shutter patte Figure 5 are used. The shutter patterns of subpixel 1 are shifted by half a clo of subpixel 2. Time windows for slower frequencies can increase the depth high-frequency time windows improve the depth resolution. By using halfthe depth resolution can be improved by a factor of two without shortenin the minimum time window or increasing the operating frequency of the sen When ordinary four-tap pulse iToF images sensors [29] are used for scene, the shutter patterns only for one frequency can be applied at once, e. patterns for subpixel 4 in Figure 5. They cannot resolve the multi-path inter reflections are detected in the same time windows. It is possible for them t shutter patterns in Figure 5. However, image acquisition should be perform for each set of the shutter patterns for each subpixel. Thus, this implementat from motion artifacts. The benefit of our multi-tap macro-pixel computation age sensor is single-shot image acquisition. A drawback is that the macroin size than ordinary four-tap pixels.

Solving the Inverse Problem and Depth Refinement
The depth estimation process is pixelwise and composed of two stage that there are two reflections: objective light and interference light. This alg easily extended to more reflections. When ordinary four-tap pulse iToF images sensors [29] are used for the dual-path scene, the shutter patterns only for one frequency can be applied at once, e.g., the shutter patterns for subpixel 4 in Figure 5. They cannot resolve the multi-path interference if two reflections are detected in the same time windows. It is possible for them to emulate the shutter patterns in Figure 5. However, image acquisition should be performed four times for each set of the shutter patterns for each subpixel. Thus, this implementation will suffer from motion artifacts. The benefit of our multi-tap macro-pixel computational CMOS image sensor is single-shot image acquisition. A drawback is that the macro-pixel is larger in size than ordinary four-tap pixels.

Solving the Inverse Problem and Depth Refinement
The depth estimation process is pixelwise and composed of two stages. We assume that there are two reflections: objective light and interference light. This algorithm can be easily extended to more reflections.
First, we reconstruct the scene response function denoted byx from the sensor output p, which is a measured version of y, and the measurement matrix A. x is a discrete version of f (t), and A describes the shutter patterns {w m (t)} and the total system response L(t). Next, we find peaks inx and obtain K peak depths {d i } and their amplitudes {a i } (i = 1, . . . , K). When two adjacent peaks are too close, only one peak is found. To overcome this problem, we add two peaks at both ends of the highest peak that is most likely to include the objective light. Their amplitudes are half of the peak. Then, for all combinations of two arbitrary peaks among the K + 2 peaks, their depths and amplitudes are refined by a nonlinear search algorithm. Refinement is performed by minimizing the following evaluation function for two peaks i and j (i, j = 1, . . . , K + 2, i = j): Here, the amplitudes of the sensor output are normalized. F is the forward problem function and provides a simulated sensor output for a, d i , and d j , where a is the ratio of the two amplitudes. The sensor output is normalized by the sum of the elements, i.e.,

Simulation Method
The simulation flow is shown in Figure 6. For simplicity, only photon shot noise is considered, whereas sensor random noise is neglected. The shot noise is given as follows: First, we determine the total number of photons for the M taps, which is given by N op , to define the intensity of the reflected light or signal level. The sensor output p is scaled to make M ∑ m=1 p m equal to N op . Then, shot noise is given to each scaled p m based on the Poisson distribution to generate p electrons. The depth and amplitude are estimated from p by the method explained above. This process is repeated R times to determine the precision and accuracy of the estimated depths, which are evaluated by using the relative standard deviation (RSD) and relative mean error (or simply error). Note that RSD and error refer to precision (or uncertainty) and accuracy (or nonlinearity), respectively. We compare these indexes for different N op 's and a's.
As for the multi-path interference, single-path and dual-path scenarios are simulated. The single-path situation is essential in the depth estimation. It is necessary to test if the proposed algorithm correctly works for this simplest case. The resolution of multi-path interference in the dual-path situation is the main issue in this study. It is expected that the estimated depth of the objective light becomes erroneous as the number of photons decreases or the amplitude of the interference light increases. With the depth of the objective light fixed, the dependency of the estimated depths for both objective and interference light on the depth of the interference light is investigated.

Single Path
First, we simulated single paths (with no interference light) tions: was 5000, 10,000, 20,000, or 40,000. The amplitude r pendent decay of the number of photons was not considered. T from 1 to 32 m in steps of 1 m. The simulation results in Figure were very small over the range. It was confirmed that the propos mally even for the single-path scenario.
Then, simulation was conducted for a realistic case consider decay of light intensity. Figure 8 shows the simulation results. increased because the shot noise relatively increased. The wavef lating a little, but mostly constant. No nonlinearity was observed For the number of photons, two different cases are discussed. In reality, the number of the received photons at the ToF image sensor decays in inverse proportion to the square of the depth of the object. In the simplest case, this decay is not considered to investigate the natural behavior of the proposed algorithm. Then, the decay of the received photon number is considered.
The simulation conditions were as follows: The clock frequency for the shutter generator was 73 MHz. Therefore, the unit time was 13.7 ns which corresponded to approximately 2 m in depth. The measurable depth range was 32.64 m for the shutter length of 32 bits. The time step in the simulations was 0.05 ns (≈0.75 cm in depth). The time constant τ of the system response in Equation (9) was set to 1 ns. The duration of the laser pulse was as long as the shortest time window, namely 13.7 ns. R was set to 100.
To generate the shot noise based on the Poisson distribution, the MATLAB function imnoise was used. TVAL3 was used as a compressive sensing solver [24,25]. The parameters of TVAL3 were as follows: nonneg was true, µ was 2 4 , β was 2 −2 , µ 0 was 2 1 , β 0 was 2 −3 , and other parameters were the default values. To find peaks, findpeaks in MATLAB Signal Processing Toolbox was used. To refine the peak positions, fminsearchbnd [26] was used to apply non-negative constraints.

Single Path
First, we simulated single paths (with no interference light) for the following conditions: N op was 5000, 10,000, 20,000, or 40,000. The amplitude ratio a was 1. Depthdependent decay of the number of photons was not considered. The depth d was varied from 1 to 32 m in steps of 1 m. The simulation results in Figure 7 show that depth errors were very small over the range. It was confirmed that the proposed method worked normally even for the single-path scenario.
Then, simulation was conducted for a realistic case considering the dep decay of light intensity. Figure 8 shows the simulation results. As de increased because the shot noise relatively increased. The waveforms of RS lating a little, but mostly constant. No nonlinearity was observed, as shown  Then, simulation was conducted for a realistic case considering the depth-dependent decay of light intensity. Figure 8 shows the simulation results. As N op decreased, RSD increased because the shot noise relatively increased. The waveforms of RSD were oscillating a little, but mostly constant. No nonlinearity was observed, as shown in Figure 8b.

Dual-Path
In the dual-path scenario, two situations were considered. In Type

Dual-Path
In the dual-path scenario, two situations were considered. In Type A, the total photon number was changed while the amplitude was fixed. In Type B, the amplitude was changed with the total photon number fixed. All the conditions are shown in Table 3. d 1 and a 1 are for the objective light. d 2 and a 2 are for the interference light. The reflection's depth d 1 and amplitude a 1 were always 16 m and 1, respectively. The interference depth d 2 was moved from 1 m to 32 m in 1 m steps. Because this is a numerical investigation, the interference light can come after the objective light.
Firstly, simulations without the depth-dependent decay of the number of photons were performed. Figure 9 shows the results of the simulation. Both the relative error and RSD of d 1 (objective reflection) became large as N op decreased or a 2 increased, especially when the objective light and interference light were merged. As shown in the results of Type A (Figure 9a-d), the absolute value of the relative mean error for different light intensities was <1.2% in both d 1 and d 2 for all given N op 's except when d 2 was around 1 m or 16 m. The results of Type B are shown in Figure 9e-h. The absolute value of the relative mean error for different amplitudes of the interference light was <0.8% in both d 1 and d 2 except when d 2 was around 1 m, 16 m, or 26 m. The reason why the depth error at around 1 m is relatively large is that the relative errors become large as the distance becomes small. Other issues related to the large errors are discussed below. Figure 10 shows the simulation results for Type B when the depth-dependent decay of photons was considered. N op was adjusted to be 5000 when the depth was 4 m and the amplitude was 1, and it was in inverse proportion to the square of depth. RSD and error of d 1 became large as the contribution of the interference light increased. The behavior of RSD and error of d 2 is interesting. As the amplitude of the interference light becomes small, both RSD and error become large. At far distances, the nonlinearity of d 2 for a 2 = 0.1 is significantly large. This is possibly because d 2 becomes closer to d 1 = 16 m.

Experimental System
In this section, the aforementioned method is demonstrated with a prototype multitap macro-pixel computational CMOS image sensor shown in Table 2 and Figure 2. The experimental system setup is shown in Figure 11. A semiconductor pulsed laser (Tama

Experimental System
In this section, the aforementioned method is demonstrated with a prototype multitap macro-pixel computational CMOS image sensor shown in Table 2 and Figure 2. The experimental system setup is shown in Figure 11. A semiconductor pulsed laser (Tama Electric, Hamamatsu, Japan, Model LDS-320, λ = 850 nm) emitted light with a duration of 21 ns toward the objective mirror and the acrylic plate through a mirror having a hole in the middle (holed mirror) to realize a semi-coaxial configuration. The distance between the acrylic plate and the holed mirror was 1 or 2 m. The objective mirror was placed 9.3 m away from the holed mirror. The 16 compressed images for the 16 32-bit shutters shown in Figure 5 were acquired in a single shot. The shutter controller frequency was set to 73 MHz. In this experiment, no imaging lens was used. The objective and interference light directly illuminated the image sensor. In solving the inverse problem, it was assumed that the subpixels in the same macro-pixel were uniformly illuminated. The saturation level of the sensor was 67558 LSB. The read noise of the sensor was 376 LSB. The correlation of the total system response of the ToF camera syste shutters, which was equivalent to the sensor output , was measured b aging. The measurement was performed while the emission timing de laser was scanned over the measurable range. The measured functio follows: The measured responses shown in Figure 10 were used in every s problem solving. The measurement matrix was generated by { ′ ( )}, and the sensor output in the refinement stage was given by the the sensor output in Figure 12 for different times of flight (= light pulse = ′ ( ) ( ) . The correlation of the total system response of the ToF camera system and the coded shutters, which was equivalent to the sensor output p, was measured before the ToF imaging. The measurement was performed while the emission timing delay of the pulsed laser t was scanned over the measurable range. The measured functions are denoted as follows: The measured responses shown in Figure 10 were used in every step of the inverse problem solving. The measurement matrix A was generated by down-sampling {p m (t)}, and the sensor output in the refinement stage was given by the superposition of the sensor output in Figure 12 for different times of flight (=light pulse delays) such as { ′ ( )}, and the sensor output in the refinement stage was given by the superpo the sensor output in Figure 12 for different times of flight (= light pulse delays) su = ′ ( ) ( ) .  Figure 13 shows 16 temporally compressed images when the objective mirror and the acrylic plate were placed at 9.3 m and 1 m, respectively. For comparison, single-path scenes were also measured. Figure 13a,b show the captured images for the single-path scenes. The former is for only the acrylic plate, and the latter is for only the mirror. Figure 13c is for the dual-path situation. Two reflections from both acrylic plate and mirror are overlapped. Note that the interference light from the acrylic plate illuminated the bottom half of the image sensor. Figure 14a,b show the reproduced depth images (33 × 52 pixels) for the acrylic plate placed at 1 m and the mirror placed at 9.3 m, respectively. The whole image sensor was illuminated. Figure 14c,d shows the depths reproduced from the images in Figure 13 for the dual-path scenario.

R REVIEW
14 of 18 Figure 13 shows 16 temporally compressed images when the objective mirror and the acrylic plate were placed at 9.3 m and 1 m, respectively. For comparison, single-path scenes were also measured. Figure 13a and Figure 13b show the captured images for the single-path scenes. The former is for only the acrylic plate, and the latter is for only the mirror. Figure 13c is for the dual-path situation. Two reflections from both acrylic plate and mirror are overlapped. Note that the interference light from the acrylic plate illuminated the bottom half of the image sensor. Figure 14a,b show the reproduced depth images (33 × 52 pixels) for the acrylic plate placed at 1 m and the mirror placed at 9.3 m, respectively. The whole image sensor was illuminated. Figure 14c,d shows the depths reproduced from the images in Figure 13 for the dual-path scenario.

Measured and Processed Results
The total processing time for the dual-path scene was 74 min. The percentages of the processing time for the first stage (coarse depth estimation by TVAL3) and the second stage (depth refinement by fminsearchbnd) were 0.7% and 99.3%, respectively. Specifications of the PC are as follows: CPU: Intel Core i7-9700 (3 GHz, eight cores), RAM: 16 GB. Signal processing was conducted on MATLAB Version 9.3.0.713570. Figure 13. Captured images in the dual-path situation for the objective mirror and the acrylic plate placed at 9.3 m and 1 m, respectively. One hundred images are averaged. Figure 15 shows the depth histograms for the single-path and dual-path situations. Table 4 summarizes the mean peak positions of the histograms, where the pixels with  Figure 15 shows the depth histograms for the single-path and dual-path situations. Table 4 summarizes the mean peak positions of the histograms, where the pixels with pixel values more than half of the maximum pixel value are considered. As shown in the table, multiple depth components were separated with an absolute error less than 0.2 m compared with the depths acquired in the single-path scenario.  The total processing time for the dual-path scene was 74 min. The percentages of the processing time for the first stage (coarse depth estimation by TVAL3) and the second stage (depth refinement by fminsearchbnd) were 0.7% and 99.3%, respectively. Specifications of the PC are as follows: CPU: Intel Core i7-9700 (3 GHz, eight cores), RAM: 16 GB. Signal processing was conducted on MATLAB Version 9.3.0.713570. Figure 15 shows the depth histograms for the single-path and dual-path situations. Table 4 summarizes the mean peak positions of the histograms, where the pixels with pixel values more than half of the maximum pixel value are considered. As shown in the table, multiple depth components were separated with an absolute error less than 0.2 m compared with the depths acquired in the single-path scenario.

Limitations
In the simulation, one of the drawbacks of the proposed method is a relatively large depth error when the two light reflections are very close. In Figure 8, the depth error of the interference light is significantly large between 14 m and 18 m. This can happen when the two pulses are merged. The most effective way to improve the separation is to speed up the operating frequency of the image sensor to shorten the minimal time window duration. Because the sensor is still a prototype, it has problems such as low photosensitivity, large pixel size, and slow operating frequency.
In Figure 9e-h, a huge error is observed at 26 m. At this distance, d 1 and d 2 are estimated to be 17.96 m and 24.39 m when no shot noise is considered, although they should be 16 m and 26 m, respectively. Figure 16 compares the simulated sensor output p for d 1 = 16 m, d 2 = 26 m, and d 1 = 17.96 m, d 2 = 24.39 m. The signals are similar in terms of the peak positions. Therefore, the solution could converge to a local minimum. A depth refinement method or better shutter patterns that can find the global minimum should be explored.

Limitations
In the simulation, one of the drawbacks of the proposed method is a relatively large depth error when the two light reflections are very close. In Figure 8, the depth error of the interference light is significantly large between 14 m and 18 m. This can happen when the two pulses are merged. The most effective way to improve the separation is to speed up the operating frequency of the image sensor to shorten the minimal time window duration. Because the sensor is still a prototype, it has problems such as low photosensitivity, large pixel size, and slow operating frequency.
In Figure 9e-h, a huge error is observed at 26 m. At this distance, and are estimated to be 17.96 m and 24.39 m when no shot noise is considered, although they should be 16 m and 26 m, respectively. Figure  In case 2, another ambiguity arises. Because the time window duration for T j2 is twice longer than that for T j1 , there are two choices as follows: It is also necessary for the precision and accuracy to be small enough to determine the combinations of the detected signals for the multiple frequencies. To improve the capability of multi-path resolution, it is effective to increase the number of taps of the charge modulator. However, trade-off among the pixel size, response time, and separability should be considered.
Although only surface reflection was incorporated in the mathematical model in this paper, we believe that the proposed method could be extended to compensate multiple reflections and volumetric scattering because they significantly increase the path length or time of flight of light. The most difficult point is that the reflected light waveform is not the same as that of the emitted light. Probably, spatio-temporal point spread functions for these multi-path components should be also estimated. On the other hand, separation of sub-surface scattering could be more challenging and requires different techniques other than the application of temporal shutter patterns because the light intensity caused by the sub-surface scattering decays very quickly, e.g., in less than hundreds of picoseconds.
Improvement of the algorithm is also necessary. Although it is assumed that the incident light signal is uniform over a macro-pixel, we should consider the point spread function to realize subpixel resolution. Development of a non-iterative real-time signal processing algorithm is also key. In this paper, our aim was to investigate the potential of the multi-tap macro-pixel computational CMOS image sensor in multi-path separation. As shown in Section 4.2, the processing time for the depth refinement stage occupies 99.3% of the total processing time. Most of the processing time was dedicated to the nonlinear optimization. As the next step, algorithms suitable for real-time processing should be explored [30].

Conclusions
In this paper, the separation of multi-path components in time-of-flight depth imaging using a multi-tap macro-pixel computational CMOS image sensor was demonstrated. The macro-pixel has 2 × 2 subpixels, and each subpixel is implemented by a four-tap lateral electric field charge modulator (LEFM). Thus, 16 images for different temporal shutters are acquired in a single shot. Multiple frequency-time windows and sub-clock shifting were adopted to separate the multi-path interference components and to achieve both long distance range and high distance resolution. The following two-step depth estimation was used to investigate the potential of the proposed scheme for multi-path component separation. In the simulation without depth-dependent decay of the number of photons, the absolute value of the relative mean error was <1.2% for both d 1 (objective depth) and d 2 (interference depth), except when d 2 was very small (at 1 m) or two reflections were merged (at 16 m). When the amplitudes of the two reflections were the same, the error at d 2 = 26 m became large, probably because the solution could converge to a local minimum. For the simulation considering the depth-dependent decay of light intensity, the nonlinearity of the interference light increased significantly at long distances, probably under the effect of the strong objective light. In the experiment, we separated two reflections from an acrylic plate placed at 1 m or 2 m and a mirror placed at 9.3 m. The absolute value of the mean error compared with the single-path situation was less than 0.2 m.